5  Genomics and Epigenomics Analysis

5.1 CNVs

Import CNVs

5.1.1 Filtering & QC

CNVs collapsed at the gene level available for 24776 genes

How many are all zeros??

not a single one with all zeros :/

cutoff density = 5 because overwelming abundances of 0s

There is obviously erinchment in CNVs in KICH when compared to KIRC and KIRP. KIRP has as well an higher number of CNVs than KIRP.

Figure 1: CNVs distributions

5.1.2 CNVs signatures across Kidney cancers

Followed guidelines and methodlogies explored by Alexandrov et al., 2022 and Steele et al., 2022

We generates novel signatures with SigProfiler

Need to reformat as::

The CNV signatures you have will be binned based on copy number and segment size. It looks like the data you have is in the format:

Copy Number>:

Copy number = The number of alleles in the CNV Label = Name for the copy number event: Homdel = Homozygous deletion. This is CNA=0 LOH = Loss of heterozygousity. One of the parental alleles has been lost. Het = Heterzygous. Both parental alleles are present. Segment size = Size of the CNA segment.

Hope that helps!

TODO:

  • add genomic ranges to genes
  • merge consecutive genes spans with same CNV
  • plot them on CIRCOs

Figure 1: CNVs distributions

5.1.3 Dimesionality Reduction and Dataset Exploration

5.1.3.1 UMAP on filtered transcriptomics data

After we excluded biopsys from normal tissues and other tumors and we filtered out the lowly expressed genes, the UMAP shows three clusters that are better refined than the ones depicted in Section 1.2.1, and that roughly correspondes with three different kidney cancer subtypes.

Figure 2: Kidney UMAP

5.1.3.2 PCA

Figure 3: PCA, exploration

The first 24 Principal Components capture more than 80% of the variance in the Kidney cancers transcriptomics dataset, with the first two components (PC1 and PC2) capturing a bit more than 25% of the variance.

When we project the samples in the PC1 and PC2, we can see that the PC1 separates KIRC from KICH adn KIRP, which instead cluster together. The second component PC2, instead, seems to partially separate KICH and KIRP samples.

Figure 4: PCA, biplot PC1 x PC2

We can also investigate other dimensions Principal Components, to see if there is a component that manages to fully resove the three cancer types. PC4 seems to separates better the KICH from KIRP, while PC1 can discriminate between KIRC and KIRP.

Figure 5: PCA, pairs plots

Figure 5: PCA, correlations between clinical covariates and Principal Components

5.1.4 CNVs on Factor3 genes

Figure 1: CNVs distributions

5.1.4.1 Mutational signatures associated to Factor3

Figure 1: CNVs distributions

5.2 Variants

SOMATIC MUTATIONS

(https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz)

711 samples with somatic mutation info available for 40542 genes, 26958 without mutations, 13584 with mutation

Figure 1: CNVs distributions

5.2.1 Mutational signatures across Kidney cancers

Figure 1: CNVs distributions

WHAT IS IN CHROMOSOME 3????????

Using external_gene_name as id variables

5.2.2 Dimesionality Reduction and Dataset Exploration

5.2.2.1 UMAP on filtered transcriptomics data

After we excluded biopsys from normal tissues and other tumors and we filtered out the lowly expressed genes, the UMAP shows three clusters that are better refined than the ones depicted in Section 1.2.1, and that roughly correspondes with three different kidney cancer subtypes.

Figure 2: Kidney UMAP

5.2.2.2 PCA

Figure 3: PCA, exploration

The first 24 Principal Components capture more than 80% of the variance in the Kidney cancers transcriptomics dataset, with the first two components (PC1 and PC2) capturing a bit more than 25% of the variance.

When we project the samples in the PC1 and PC2, we can see that the PC1 separates KIRC from KICH adn KIRP, which instead cluster together. The second component PC2, instead, seems to partially separate KICH and KIRP samples.

We can also investigate other dimensions Principal Components, to see if there is a component that manages to fully resove the three cancer types. PC4 seems to separates better the KICH from KIRP, while PC1 can discriminate between KIRC and KIRP.

Figure 5: PCA, pairs plots

Figure 5: PCA, correlations between clinical covariates and Principal Components

5.2.3 Somatic Mutations on Factor3 genes

Figure 1: CNVs distributions

5.2.3.1 Mutational signatures associated to Factor3

Figure 1: CNVs distributions

5.3 Epigenomics

646 samples with methylation info 148068 whitelisted probes

Regardless of array type, both the 450k and EPIC record two measurements for each CpG: a methylated intensity (M) and an unmethylated intensity (U). Using these values, the proportion of methylation at each site CpG locus can be determined. The level of methylation at a locus is commonly reported as the Beta-value, i.e. the ratio of the methylated probe intensity and the overall intensity:

beta = M/(M/U)

Illumina recommends adding a constant offset α (by default, α = 100) to the denominator to regularize Beta value when both methylated and unmethylated probe intensities are low. The Beta-value statistic results in a number between 0 and 1, or 0 and 100%. Under ideal conditions, a value of zero indicates that all copies of the CpG site in the sample were completely unmethylated (no methylated molecules were measured) and a value of one indicates that every copy of the site was methylated.

5.3.1 Filtering & QC

removed 6364 probes with missing values

148068-141704 [1] 6364

Figure 1: CNVs distributions

5.3.2 Dimesionality Reduction and Dataset Exploration

5.3.2.1 UMAP on filtered transcriptomics data

After we excluded biopsys from normal tissues and other tumors and we filtered out the lowly expressed genes, the UMAP shows three clusters that are better refined than the ones depicted in Section 1.2.1, and that roughly correspondes with three different kidney cancer subtypes.

Figure 2: Kidney UMAP

5.3.2.2 PCA

Figure 3: PCA, exploration

The first 24 Principal Components capture more than 80% of the variance in the Kidney cancers transcriptomics dataset, with the first two components (PC1 and PC2) capturing a bit more than 25% of the variance.

When we project the samples in the PC1 and PC2, we can see that the PC1 separates KIRC from KICH adn KIRP, which instead cluster together. The second component PC2, instead, seems to partially separate KICH and KIRP samples.

Figure 4: PCA, biplot PC1 x PC2

We can also investigate other dimensions Principal Components, to see if there is a component that manages to fully resove the three cancer types. PC4 seems to separates better the KICH from KIRP, while PC1 can discriminate between KIRC and KIRP.

Figure 5: PCA, pairs plots

Figure 5: PCA, loadings

cg03830585 –> ITPR1 protects renal cancer cells against natural killer cells by inducing autophagy

cg00868875 –> KCTD1

cg07037412 –> CACNA1H Voltage-gated calcium channels: Novel targets for cancer therapy CACNA1H was downregulated in gastrointestinal stromal tumor, sarcoma and renal cancer Notably, compared with our previous research, CACNA1H was specifically overexpressed relative to normal tissue samples in renal cancer, sarcoma and gastrointestinal stromal tumors.

Figure 5: PCA, correlations between clinical covariates and Principal Components

5.3.3 Probe-Wise Differential Methylation

Figure 5: top 3 Differentially Methylated Probes per cancer type

print(summary(dt)) KIRC_vs_KICH KIRP_vs_KICH KIRC_vs_KIRP Down 56952 43222 69811 NotSig 45814 51748 42083 Up 45302 53098 36174

Figure 6: Volcano plots of each contrats reporting the genes differentially expressed

Figure 7: UpSet reporting the genes differentially methylated probes in common across all contrasts

Tables of differentially methylated probes.

ds

5.3.3.1 Functional Enrichments

Let’s see if any function is enriched in the genes of the Differentially Methilated Probes. Simply extract the methylated probes, compare agains the background ans see if the associated genes are diofferentialy methylated, and see the fucntions / pathways associated with them.

GO terms

KEGG pathways

GO terms

KEGG

s

ss

s

s

s

s

s

Let’s compare clusters

s

d

s s

Figure 17: dotplot of enriched KEGG pathways

Figure 17: dotplot of enriched KEGG pathways

Figure 17: dotplot of enriched KEGG pathways

x

Figure 20: dotplot of enriched Reactome pathways

Figure 20: dotplot of enriched Reactome pathways

Figure 20: dotplot of enriched Reactome pathways

s

x ssss

s

5.3.3.2 Aaggreagte probes into functional regions

Often, differential methylation of a single CpG is not so informative or can be hard to detect. Therefore, knowing whether several CpGs near to each other (or regions) are concordantly differentially methylated can be of greater interest.

There are several Bioconductor packages that have functions for identifying differentially methylated regions from 450k data. Some of the most popular are the dmrFind function in the charm package, which has been somewhat superseded for 450k arrays by the bumphunter function in minfi, and, the dmrcate in the DMRcate package. They are each based on different statistical methods, but we will be using dmrcate here, as it is based on limma and thus we can use the design and contrast matrix we defined earlier.

We will again start from our matrix of M-values. For this kind of analysis, this matrix has to be annotated with the chromosomal position of the CpGs and their gene annotations. Because in a first step the limma differential methylation analysis for single CpGs will be run again, we need to specify the design matrix, contrast matrix and contrast of interest.

Once we have the relevant statistics for the individual CpGs, we can then use the dmrcate function to combine them to identify differentially methylated regions. Of particular interest here is the lambda parameter; this value is the number of nucleotides that is allowed between significant CpGs before splitting them up in different regions. So a smaller lambda will result in more but smaller regions. For array data, the authors of the dmrcate package currently recommend a lambda of 1000. The main output table DMRs contains all of the regions found, along with their genomic annotations and p-values. To inspect this object and further visualization, you can best use the extractRanges function to create a GRanges object.

Differentially Methylated regions

KIRC_vs_KICH: 20346 KIRP_vs_KICH: 19243 KIRC_vs_KIRP: 20773

Just as for the single CpG analysis, it is a good idea to visually inspect the results to make sure they make sense. For this, use the DMR.plot function. By default, this plot draws the location of the DMR in the genome, the position of nearby genes, the positions of the CpG probes, the Beta value levels of each sample as a heatmap and the mean methylation levels for the various sample groups in the experiment.

5.3.3.3 Functional Regions mCSEA

An alternative approach to detect DMRs is to predefine the regions to be tested; so, as opposed to the previous approach where the regions are defined according to heuristic distance rules we can define regions based on a shared function. For this, we will used the package mCSEA which contains three types of regions for 450K and EPIC arrays: promoter regions, gene body and CpG Islands. mCSEA is based on Gene Set Enrichment analysis (GSEA), a popular methodology for functional analysis that was specifically designed to avoid some drawbacks in the field of gene expression. Briefly, CpG sites are ranked according to a metric (logFC, t-statistic, …) and an enrichment score (ES) is calculated for each region. This is done by running through the entire ranked CpG list, increasing the score when a CpG in the region is encountered and decreasing the score when the gene encountered is not in the region. A high ES indicates these probes are found high up in the ranked list. In other words, a high (N)ES value means that for the CpG sites in this region there is - on average - a shift towards a higher methylation level. This approach has been shown to be more effective to detect smaller but consistent methylation differences.

promoters

genes

85promoters, 150 genes

74 promoters 0 genes

promoters

genes

172 promoters, 50 genes

promoters

genes

cc

5.3.3.3.1 mCSEA Functional Enrichments

Let’s see if any function is enriched in the genes of the Differentially Methilated Probes. Simply extract the methylated probes, compare agains the background ans see if the associated genes are diofferentialy methylated, and see the fucntions / pathways associated with them.

d

Figure 8: dotplot of enriched GO Biological Process terms

Figure 8: dotplot of enriched GO Biological Process terms

Figure 8: dotplot of enriched GO Biological Process terms

Figure 17: dotplot of enriched KEGG pathways

Figure 17: dotplot of enriched KEGG pathways

Figure 20: dotplot of enriched Reactome pathways

Figure 20: dotplot of enriched Reactome pathways

Figure 20: dotplot of enriched Wikipathways pathways

c

5.4 Lessons Learnt

[[[[PROPER DESCRIPTION OF FINDINGS AND TRANSCRIPTOMICS ENRICHMENTS]]]]

So far, we have learnt:

  • A